582 research outputs found

    Efficient estimation algorithms for large and complex data sets

    Get PDF
    The recent world-wide surge in available data allows the investigation of many new and sophisticated questions that were inconceivable just a few years ago. However, two types of data sets often complicate the subsequent analysis: Data that is simple in structure but large in size, and data that is small in size but complex in structure. These two kinds of problems also apply to biological data. For example, data sets acquired from family studies, where the data can be visualized as pedigrees, are small in size but, because of the dependencies within families, they are complex in structure. By comparison, next-generation sequencing data, such as data from chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq), is simple in structure but large in size. Even though the available computational power is increasing steadily, it often cannot keep up with the massive amounts of new data that are being acquired. In these situations, ordinary methods are no longer applicable or scale badly with increasing sample size. The challenge in today’s environment is then to adapt common algorithms for modern data sets. This dissertation considers the challenge of performing inference on modern data sets, and approaches the problem in two parts: first using a problem in the field of genetics, and then using one from molecular biology. In the first part, we focus on data of a complex nature. Specifically, we analyze data from a family study on colorectal cancer (CRC). To model familial clusters of increased cancer risk, we assume inheritable but latent variables for a risk factor that increases the hazard rate for the occurrence of CRC. During parameter estimation, the inheritability of this latent variable necessitates a marginalization of the likelihood that is costly in time for large families. We first approached this problem by implementing computational accelerations that reduced the time for an optimization by the Nelder-Mead method to about 10% of a naive implementation. In a next step, we developed an expectation-maximization (EM) algorithm that works on data obtained from pedigrees. To achieve this, we used factor graphs to factorize the likelihood into a product of “local” functions, which enabled us to apply the sum-product algorithm in the E-step, reducing the computational complexity from exponential to linear. Our algorithm thus enables parameter estimation for family studies in a feasible amount of time. In the second part, we turn to ChIP-Seq data. Previously, practitioners were required to assemble a set of tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies between experimental conditions. In order to remove these restrictions and create a unified framework for ChIP-Seq analysis, we developed GenoGAM (Genome-wide Generalized Additive Model), which extends generalized additive models to efficiently work on data spread over a long x axis by reducing the scaling from cubic to linear and by employing a data parallelism strategy. Our software makes the well-established and flexible GAM framework available for a number of genomic applications. Furthermore, the statistical framework allows for significance testing for differential occupancy. In conclusion, I show how developing algorithms of lower complexity can open the door for analyses that were previously intractable. On this basis, it is recommended to focus subsequent research efforts on lowering the complexity of existing algorithms and design new, lower-complexity algorithms

    A cortical hierarchy for differential GABAergic circuit motifs as targets for stress-induced alterations in endocannabinoid signaling

    Get PDF
    Aberrant endocannabinoid (eCB) signaling has been implicated in the pathophysiology of different stress-related psychiatric diseases, such as schizophrenia, depression, and anxiety disorders. Such eCB signaling is triggered in postsynaptic neurons and serves to transiently suppress presynaptic neurotransmitter release by activating presynaptic cannabinoid type-1 (CB1) receptors. CB1 receptors are most abundantly expressed in a subpopulation of cortical GABAergic interneurons (INs), which serve to fine-tune cortical information flow by exerting inhibitory control over excitatory networks. Thus, CB1 receptor-expressing (CB1+) INs may represent a substrate linking changes in eCB signaling with stress-induced disease states. However, their function is only poorly understood. I therefore characterized the properties of CB1+ INs in neocortex and compared them with parvalbumin-expressing (PV+) INs, a well-characterized IN type with an established role in network refinement. To this end, I used a combination of fluorescent imaging, patch-clamp electrophysiology, and pharmacology in double reporter mice, in which CB1+ and PV+ INs are genetically labeled with tdTomato and YFP, respectively. I found that cortical hierarchy strongly shaped the expression of inhibitory circuit motifs made by the two IN types. Specifically, CB1+ INs were considerably less abundant and made considerably fewer GABAergic synapses onto glutamatergic pyramidal neurons compared with PV+ INs in the primary somatosensory cortex, a representative cortical region for primary sensory processing. In contrast, the abundance and inhibitory connectivity was largely balanced between the two IN types in the prefrontal cortex, a higher-order associative cortical structure that serves an important function in cognitive control and stress regulation. I further characterized the inhibitory circuit properties of the two IN types in the prefrontal cortex across development and assessed their vulnerability towards glucocorticoid-mediated developmental stress. To this end, mice were chronically treated with the stress hormone corticosterone during adolescence, a developmental period of heightened stress vulnerability. I found that GABAergic synapses made by the two IN types onto prefrontal pyramidal neurons reached functional maturity early, before the onset of the adolescent phase. Only their glutamatergic inputs seemed to undergo some form of synapse pruning during adolescence, but these developmental changes were mostly restricted to PV+ INs. Remarkably, GABAergic synapses made by both prefrontal PV+ and CB1+ INs were highly resistant to chronic corticosterone treatment during adolescence, despite inducing a prominent stress- and anxiety-related phenotype. Indeed, basal synaptic transmission was completely preserved at both types of synapses in mice chronically treated with corticosterone. However, GABAergic synapses made by CB1+ INs displayed a deficiency in depolarization-induced suppression of synaptic inhibition, an eCB-dependent form of short-term synaptic plasticity. These results thus indicate a specific corticosterone-induced deficit in eCB-mediated regulation of synaptic transmission at these synapses. Together, these findings suggest an important role of cortical hierarchy in shaping inhibitory circuit motifs and point to an increased importance of CB1+ INs in regulating excitatory network activity in higher-order cortical structures, such as the prefrontal cortex. Moreover, my results highlight a specific dysfunction in eCB-dependent plasticity at synapses of prefrontal CB1+ INs following chronic glucocorticoid exposure. Thus, CB1+ INs might represent a promising new target for the study of stress-related psychiatric disorders. Future studies should determine to what extent the observed deficit in eCB-dependent plasticity is causally related to stress-induced disease states.2021-09-2

    Efficient estimation algorithms for large and complex data sets

    Get PDF
    The recent world-wide surge in available data allows the investigation of many new and sophisticated questions that were inconceivable just a few years ago. However, two types of data sets often complicate the subsequent analysis: Data that is simple in structure but large in size, and data that is small in size but complex in structure. These two kinds of problems also apply to biological data. For example, data sets acquired from family studies, where the data can be visualized as pedigrees, are small in size but, because of the dependencies within families, they are complex in structure. By comparison, next-generation sequencing data, such as data from chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq), is simple in structure but large in size. Even though the available computational power is increasing steadily, it often cannot keep up with the massive amounts of new data that are being acquired. In these situations, ordinary methods are no longer applicable or scale badly with increasing sample size. The challenge in today’s environment is then to adapt common algorithms for modern data sets. This dissertation considers the challenge of performing inference on modern data sets, and approaches the problem in two parts: first using a problem in the field of genetics, and then using one from molecular biology. In the first part, we focus on data of a complex nature. Specifically, we analyze data from a family study on colorectal cancer (CRC). To model familial clusters of increased cancer risk, we assume inheritable but latent variables for a risk factor that increases the hazard rate for the occurrence of CRC. During parameter estimation, the inheritability of this latent variable necessitates a marginalization of the likelihood that is costly in time for large families. We first approached this problem by implementing computational accelerations that reduced the time for an optimization by the Nelder-Mead method to about 10% of a naive implementation. In a next step, we developed an expectation-maximization (EM) algorithm that works on data obtained from pedigrees. To achieve this, we used factor graphs to factorize the likelihood into a product of “local” functions, which enabled us to apply the sum-product algorithm in the E-step, reducing the computational complexity from exponential to linear. Our algorithm thus enables parameter estimation for family studies in a feasible amount of time. In the second part, we turn to ChIP-Seq data. Previously, practitioners were required to assemble a set of tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies between experimental conditions. In order to remove these restrictions and create a unified framework for ChIP-Seq analysis, we developed GenoGAM (Genome-wide Generalized Additive Model), which extends generalized additive models to efficiently work on data spread over a long x axis by reducing the scaling from cubic to linear and by employing a data parallelism strategy. Our software makes the well-established and flexible GAM framework available for a number of genomic applications. Furthermore, the statistical framework allows for significance testing for differential occupancy. In conclusion, I show how developing algorithms of lower complexity can open the door for analyses that were previously intractable. On this basis, it is recommended to focus subsequent research efforts on lowering the complexity of existing algorithms and design new, lower-complexity algorithms

    Efficient Maximum Likelihood Estimation for Pedigree Data with the Sum-Product Algorithm

    Get PDF
    In this paper, we analyze data sets consisting of pedigrees where the response is the age at onset of colorectal cancer (CRC). The occurrence of familial clusters of CRC suggests the existence of a latent, inheritable risk factor. We aimed to compute the probability of a family possessing this risk factor, as well as the hazard rate increase for these risk factor carriers. Due to the inheritability of this risk factor, the estimation necessitates a costly marginalization of the likelihood. We therefore developed an EM algorithm by applying factor graphs and the sum-product algorithm in the E-step, reducing the computational complexity from exponential to linear in the number of family members. Our algorithm is as precise as a direct likelihood maximization in a simulation study and a real family study on CRC risk. For 250 simulated families of size 19 and 21, the runtime of our algorithm is faster by a factor of 4 and 29, respectively. On the largest family (23 members) in the real data, our algorithm is 6 times faster. We introduce a flexible and runtime-efficient tool for statistical inference in biomedical event data that opens the door for advanced analyses of pedigree data

    Numerical study of the statistical characteristics of the mixing processes in rivers

    Get PDF
    A detailed analysis of statistical characteristics of the vertical mixing process in a horizontally homogeneous and stationary river flow is given. Stochastic models of Langevin type and random displacement models are developed to calculate the statistical characteristics of the vertical mixing. For validation, Langevin type models and random displacement models conventionally applied in this field are compared. All the methods show a good qualitative agreement. However the random diplacement model with constant coefficients is shown to perform with considerable deviations

    Efficient Maximum Likelihood Estimation for Pedigree Data with the Sum-Product Algorithm

    Get PDF
    OBJECTIVE We analyze data sets consisting of pedigrees with age at onset of colorectal cancer (CRC) as phenotype. The occurrence of familial clusters of CRC suggests the existence of a latent, inheritable risk factor. We aimed to compute the probability of a family possessing this risk factor as well as the hazard rate increase for these risk factor carriers. Due to the inheritability of this risk factor, the estimation necessitates a costly marginalization of the likelihood. METHODS We propose an improved EM algorithm by applying factor graphs and the sum-product algorithm in the E-step. This reduces the computational complexity from exponential to linear in the number of family members. RESULTS Our algorithm is as precise as a direct likelihood maximization in a simulation study and a real family study on CRC risk. For 250 simulated families of size 19 and 21, the runtime of our algorithm is faster by a factor of 4 and 29, respectively. On the largest family (23 members) in the real data, our algorithm is 6 times faster. CONCLUSION We introduce a flexible and runtime-efficient tool for statistical inference in biomedical event data with latent variables that opens the door for advanced analyses of pedigree data

    The Skorokhod embedding problem for inhomogeneous diffusions

    Get PDF
    We solve the Skorokhod embedding problem for a class of stochastic processes satisfying an inhomogeneous stochastic differential equation (SDE) of the form dAt=μ(t,At)dt+σ(t,At)dWtd A_t =\mu (t, A_t) d t + \sigma(t, A_t) d W_t. We provide sufficient conditions guaranteeing that for a given probability measure ν\nu on R\mathbb{R} there exists a bounded stopping time τ\tau and a real aa such that the solution (At)(A_t) of the SDE with initial value aa satisfies AτνA_\tau \sim \nu. We hereby distinguish the cases where (At)(A_t) is a solution of the SDE in a weak or strong sense. Our construction of embedding stopping times is based on a solution of a fully coupled forward-backward SDE. We use the so-called method of decoupling fields for verifying that the FBSDE has a unique solution. Finally, we sketch an algorithm for putting our theoretical construction into practice and illustrate it with a numerical experiment.Comment: 39 pages, 2 pictures, To appear in Annales de l'Institut Henri Poincare (B) Probability and Statistic

    Untersuchung der Kondensationsreaktionen der Mono-, Di- und Trikieselsäure mit der Trimethylsilylierungsmethode und 29Si-NMR-Spektroskopie

    Get PDF
    Die Kondensationsreaktionen der Mono-, Di- bzw. Trikieselsäure in saurerwäßriger Lösung (CSiO2 = 0,5 m; pH = 2; T = -2°C), hergestellt durch Hydrolyse von Tetramethoxysilan, Hexamethoxydisiloxan bzw. Octamethoxytrisiloxan, wurden in Abhängigkeit von der Zeit quantitative untersucht. Es wird gezeigt, daß die Monokieselsäure über ein Gemisch aus Di-, Tri-, Tetra-, Cyclotetra- und Bicyclohexakieselsäuren zu höhermolekularen Kieselsäuren reagiert. In, Di- und Trikieselsäurelösungen läuft parallel zu den Kondensationsreaktionen eine teilweise Hydrolyse unter Bildung von Mono- bzw. Dikieselsäure ab, so daß bereits nach kurzer Zeit die gleichen Reaktionsprodukte wie in der Monokieselsäurelösung auftreten. Die Hydrolyse und Kondensation der niedermolekularen Kieselsäuren werden durch Gleichgewichtsreaktionen bestimmt, die nach längeren Reaktionszeiten (etwa 6 h) zu konstanten Konzentrationsverhältnissen führen, die unabhängig von der Art der Ausgangskieselsäure sind. In allen Fällen werden keine molekulareinheitlichen Kondensationsprodukte, sondern stets Gemische von Kieselsäuren mit unterschiedlichem Kondensationsgrad beobachtet, in denen auch nach 96 h noch geringe Mengen Mono-, Di- und Trikieselsäure nachweisbar sind

    Aus baltischer Geistesarbeit : Reden und Aufsätze. Bd.2, H.7

    Get PDF
    http://tartu.ester.ee/record=b1318718~S1*es
    corecore